GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters by HuaHuaY · Pull Request #50008 · apache/arrow

HuaHuaY · 2026-05-21T10:18:59Z

Rationale for this change

This PR follows apache/arrow-rs#9628. It supports optimizing the disk usage of the Bloom filter. So specifying an ndv value larger than the actual value will not affect disk usage.

Bloom filters now support folding mode: allocate a conservatively large filter (sized for worst-case NDV), insert all values during writing, then fold down at flush time to meet a target FPP. This eliminates the need to guess NDV upfront and produces optimally-sized filters automatically.

What changes are included in this PR?

BloomFilterBuilder will try to fold the bloom filter before writing it to the output stream.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

The type of ndv in BloomFilterOptions is changed from int32_t to std::optional<int64_t>. And the argument type of OptimalNumOfBytes and OptimalNumOfBits in BlockSplitBloomFilter is changed from uint32_t ndv to uint64_t ndv

GitHub Issue: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters #50007

github-actions · 2026-05-21T10:20:34Z

⚠️ GitHub issue #50007 has been automatically assigned in GitHub to PR creator.

HuaHuaY · 2026-05-21T12:36:08Z

-      std::map</*column_id=*/int32_t, std::shared_ptr<BloomFilter>>;
+  struct RowGroupBloomFilters {
+    RowGroupBloomFilters() = default;
+    RowGroupBloomFilters(RowGroupBloomFilters&&) noexcept = default;


I need these to prevent MSVC from attempting to instantiate the copy constructor. See microsoft/STL#5552 and microsoft/STL#5084.

Or we just keep using std::shared_ptr instead of std::unique_ptr. Is it important here?

HuaHuaY · 2026-05-21T12:56:23Z

@wgtmac @alamb @etseidl @emkornfield Please take a look.

alamb · 2026-05-21T18:11:16Z

I am not likely to have time to review C++ code in the arrow repository unfortunately

wgtmac

Thanks @HuaHuaY for adding this quickly!

wgtmac · 2026-05-29T02:26:19Z

          std::to_string(bloom_filter_options.fpp));
    }
+    if (bloom_filter_options.ndv.has_value() && bloom_filter_options.ndv.value() < 0) {
+      throw ParquetException("Bloom filter number of distinct values must be >= 0, got " +


What is the expected behavior of 0?

It will create a smallest bloom filter.

wgtmac · 2026-05-29T03:53:17Z

cc @mapleFU @adamreeve

wgtmac

Generally LGTM. I left some nits.

mapleFU

Generally LGTM

HuaHuaY · 2026-06-02T11:17:12Z

@pitrou @mapleFU Please take a look.

pitrou · 2026-06-02T11:50:58Z

+  const double avg_fill = static_cast<double>(total_set_bits) /
+                          (static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8);


More simply

Suggested change

const double avg_fill = static_cast<double>(total_set_bits) /

(static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8);

const double avg_fill = static_cast<double>(total_set_bits) / (num_bytes_ * 8);

pitrou · 2026-06-02T11:54:51Z

+  DCHECK_GT(num_folds, 0);
+
+  const uint32_t num_blocks = NumBlocks();
+  const uint32_t group_size = UINT32_C(1) << num_folds;


Can you add comments? It's not obvious what a "group size" is.

pitrou · 2026-06-02T11:58:11Z

+    }
+  }
+
+  num_bytes_ = new_num_blocks * kBytesPerFilterBlock;


data_ is now oversized, would it be useful to shrink it here?

pitrou · 2026-06-02T11:59:36Z

+    }
+    ++num_folds;
+  }
+  return num_folds;


With this algorithm the actual size reduction will always be a power of 2 (group_size = UINT32_C(1) << num_folds). Why aren't we trying to be more granular?

pitrou · 2026-06-02T12:02:02Z

-      std::map</*column_id=*/int32_t, std::shared_ptr<BloomFilter>>;
+  struct RowGroupBloomFilters {
+    RowGroupBloomFilters() = default;
+    RowGroupBloomFilters(RowGroupBloomFilters&&) noexcept = default;


Or we just keep using std::shared_ptr instead of std::unique_ptr. Is it important here?

pitrou · 2026-06-02T12:12:53Z

+            filter.GetBitsetSize());
+  for (uint64_t hash : hashes) {
+    EXPECT_TRUE(filter.FindHash(hash));
+  }


Should we check that most non-inserted values are not found, with an actual FPP value below kFpp?

mapleFU · 2026-06-02T13:59:55Z

+                          (static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8);
+  const auto max_folds = static_cast<uint32_t>(std::countr_zero(num_blocks));
+
+  if (avg_fill == 0.0) {


I little bit forgot would this really happens when writing a parquet file?

mapleFU · 2026-06-02T14:16:53Z

+  const auto* bitset32 = reinterpret_cast<const uint32_t*>(data_->data());
+  const uint32_t num_words = num_bytes_ / static_cast<uint32_t>(sizeof(uint32_t));
+  for (uint32_t i = 0; i < num_words; ++i) {
+    total_set_bits += static_cast<uint64_t>(std::popcount(bitset32[i]));


I don't know whether internal::CountSetBits easy to understand here ( though popcount is right and a bit faster)

mapleFU · 2026-06-02T14:25:52Z

+    BloomFilterBuilder, BloomFilterBuilderFoldingTest,
+    ::testing::Values(BloomFilterBuilderFoldingTestCase{.ndv = 1'000'000,
+                                                        .fold = true,
+                                                        .inserted_count = 1000,


Can we add a test for max fold (when inserted count == 0)?

HuaHuaY requested a review from wgtmac as a code owner May 21, 2026 10:19

github-actions Bot added the awaiting review Awaiting review label May 21, 2026

github-actions Bot added Component: Parquet Component: C++ labels May 21, 2026

HuaHuaY commented May 21, 2026

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 21, 2026

HuaHuaY added 3 commits May 29, 2026 10:55

add bloom filter folding to automatically size SBBF filters

4347dca

fix ci

57019b0

fix windows ci

64294f2

wgtmac reviewed May 29, 2026

View reviewed changes

fix review

9565196

HuaHuaY force-pushed the sbbf_filters branch from 4495c53 to 9565196 Compare May 29, 2026 07:55

wgtmac approved these changes May 29, 2026

View reviewed changes

Comment thread cpp/src/parquet/properties.h Outdated

Comment thread cpp/src/parquet/bloom_filter.cc Outdated

Comment thread cpp/src/parquet/properties.h

fix review

a036535

mapleFU reviewed May 29, 2026

View reviewed changes

Comment thread cpp/src/parquet/bloom_filter_writer.cc Outdated

Comment thread cpp/src/parquet/bloom_filter.cc Outdated

fix review

a363cb5

pitrou reviewed Jun 2, 2026

View reviewed changes

mapleFU approved these changes Jun 2, 2026

View reviewed changes

mapleFU reviewed Jun 2, 2026

View reviewed changes

		const double avg_fill = static_cast<double>(total_set_bits) /
		(static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8);

Conversation

HuaHuaY commented May 21, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuaHuaY commented May 21, 2026

Uh oh!

alamb commented May 21, 2026

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac commented May 29, 2026

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuaHuaY commented Jun 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HuaHuaY commented May 21, 2026 •

edited by github-actions Bot

Loading